36 research outputs found

    RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning

    Get PDF
    The technology-push of die stacking and application-pull of Big Data machine learning (BDML) have created a unique opportunity for processing-near-memory (PNM). This paper makes four contributions: (1) While previous PNM work explores general MapReduce workloads, we identify three workload characteristics: (a) irregular-and-compute-light (i.e., perform only a few operations per input word which include data-dependent branches and indirect memory accesses); (b) compact (i.e., the computation has a small intermediate live data and uses only a small amount of contiguous input data); and (c) memory-row-dense (i.e., process the input data without skipping over many bytes). We show that BDMLs have or can be transformed to have these characteristics which, except for irregularity, are necessary for bandwidth- and energyefficient PNM, irrespective of the architecture. (2) Based on these characteristics, we propose RowCore, a row-oriented PNM architecture, which (pre)fetches and operates on entire memory rows to exploit BDMLs’ row-density. Instead of this row-centric access and compute-schedule, traditional architectures opportunistically improve row locality while fetching and operating on cache blocks. (3) RowCore employs well-known MIMD execution to handle BDMLs’ irregularity, and sequential prefetch of input data to hide memory latency. In RowCore, however, one corelet prefetches a row for all the corelets which may stray far from each other due to their MIMD execution. Consequently, a leading corelet may prematurely evict the prefetched data before a lagging corelet has consumed the data. RowCore employs novel cross-corelet flow-control to prevent such eviction. (4) RowCore further exploits its flow-controlled prefetch for frequency scaling based on novel coarse-grain compute-memory rate-matching which decreases (increases) the processor clock speed when the prefetch buffers are empty (full). Using simulations, we show that RowCore improves performance and energy, by 135% and 20% over a GPGPU with prefetch, and by 35% and 34% over a multicore with prefetch, when all three architectures use the same resources (i.e., number of cores, and on-processor-die memory) and identical diestacking (i.e., GPGPUs/multicores/RowCore and DRAM)

    Architectural Support for Operating System-Driven CMP Cache Management

    Get PDF
    The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [22]. Unfortu- nately, one key resource|lower-level shared cache in chip multi-processors|is commonly managed purely in hardware by rudimentary replacement policies such as least-recently- used (LRU). The rigid nature of the hardware cache manage- ment policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under \best effort performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expecta- tion. When it comes to managing shared caches, there is an inherent tension between exibility and performance. On one hand, managing the shared cache in the OS offers immense policy exibility since it may be implemented in soft- ware. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing tempo- rally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management tech- niques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy exibility. This paper addresses this problem by designing architec- tural support for OS to effciently manage shared caches with a wide variety of policies. Our scheme consists of a hard- ware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hard- ware mechanism guarantees that OS-specifed quotas are en- forced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS in- tervention. The OS retains policy exibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive per- formance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation

    SafeBet: Secure, Simple, and Fast Speculative Execution

    Full text link
    Spectre attacks exploit microprocessor speculative execution to read and transmit forbidden data outside the attacker's trust domain and sandbox. Recent hardware schemes allow potentially-unsafe speculative accesses but prevent the secret's transmission by delaying most access-dependent instructions even in the predominantly-common, no-attack case, which incurs performance loss and hardware complexity. Instead, we propose SafeBet which allows only, and does not delay most, safe accesses, achieving both security and high performance. SafeBet is based on the key observation that speculatively accessing a destination location is safe if the location's access by the same static trust domain has been committed previously; and potentially unsafe, otherwise. We extend this observation to handle inter trust-domain code and data interactions. SafeBet employs the Speculative Memory Access Control Table (SMACT) to track non-speculative trust domain code region-destination pairs. Disallowed accesses wait until reaching commit to trigger well-known replay, with virtually no change to the pipeline. Software simulations using SpecCPU benchmarks show that SafeBet uses an 8.3-KB SMACT per core to perform within 6% on average (63% at worst) of the unsafe baseline behind which NDA-restrictive, a previous scheme of security and hardware complexity comparable to SafeBet's, lags by 83% on average

    PUMA: Purdue MapReduce Benchmarks Suite

    Get PDF

    Achieving Causal Consistency under Partial Replication for Geo-distributed Cloud Storage

    Get PDF
    Causal consistency has emerged as an attractive middle-ground to architecting cloud storage systems, as it allows for high availability and low latency, while supporting stronger-than-eventual-consistency semantics. However, causally-consistent cloud storage systems have seen limited deployment in practice. A key factor is these systems employ full replication of all the data in all the data centers (DCs), incurring high cost. A simple extension of current causal systems to support partial replication by clustering DCs into rings incurs availability and latency problems. We propose Karma, the first system to enable causal consistency for partitioned data stores while achieving the cost advantages of partial replication without the availability and latency problems of the simple extension. Our evaluation with 64 servers emulating 8 geo-distributed DCs shows that Karma (i) incurs much lower cost than a fully-replicated causal store (obviously due to the lower replication factor); and (ii) offers higher availability and better performance than the above partial-replication extension at similar costs

    A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

    Full text link
    Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.Mukherjee, S.; Silla Jiménez, F.; Bannon, P.; Emer, J.; Lang, S.; Webb, D. (2002). A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. ACM SIGPLAN Notices. 37(10):223-234. doi:10.1145/605432.605421S223234371

    TxComm: Transforming Stream Communication for Load Balance, Efficiency, and Fault-tolerance in Networks-on-Chip

    Get PDF
    Recent work has examined using application-specificknowledge of streaming communication to optimizenetwork routing (for throughput/performance) and/ordesign (for simpler hardware). However, previous techniqueshave assumed that the communication streamsare directly mapped to networks-on-chip. In contrast,this paper explores the use of communication transformations(TxComm) to achieve (1) higher throughputvia better network load balance, (2) more efficientnetwork utilization, and (3) better fault-tolerance,while retaining the communication semantics of theoriginal streaming application. Specifically, we proposetwo transformations: stream fission and streamfusion. (While fission and fusion transformations havebeen applied to computation in streaming programs, weare the first to propose fission and fusion transformationsfor stream communication.) Stream fission splitsstreams of communication to multiple streams that maybe routed over independent network paths to achievebetter network load balance. Stream fusion targets multicastcommunication and fuses multiple streams to effectivelycapture the well-known benefits of tree-basedmulticast, which include more efficient link utilization.Both techniques can be integrated in an integer linearprogram formulation that executes at compile time.Another key component of TxComm is the use of freerouting which serves two key purposes. First, it booststhe performance of fission and fusion. Second, it enablesapplication-specific fault-tolerance. Evaluationswith a suite of StreamIT benchmarks show that Tx-Comm achieves significant performance improvementover prior application-specific (non-transformed) routingtechniques. On the fault-tolerance front, TxCommachieves similar performance as a fault-free base caseeven when as many as 10% of links are faulty
    corecore